Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: simplify health/readiness checks #79

Merged
merged 2 commits into from
Sep 16, 2024

Conversation

rustatian
Copy link
Member

@rustatian rustatian commented Sep 15, 2024

Reason for This PR

closes: roadrunner-server/roadrunner#1998

Description of Changes

  • Support health/readiness checks w/o specifying plugins list.
  • TODO: update docs.
  • Strict response schema (even on 5xx status codes), array:
[
    {
        "plugin_name": "string",
        "error_message": "string",
        "status_code": "int"
    }
    {
    ...
    }
]

For the /jobs endpoint, array:

[
    {
        "pipeline": "string",
        "priority": "int",
        "ready": "bool"
        "queue": "string",
        "active": "int",
        "delayed": "int"
        "reserved": "int",
        "driver": "string",
        "error_message": "string"
    }
    {
    ...
    }
]

License Acceptance

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

PR Checklist

[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]

  • All commits in this PR are signed (git commit -s).
  • The reason for this PR is clearly provided (issue no. or explanation).
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this PR.
  • Any user-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.

Summary by CodeRabbit

  • New Features

    • Introduced new data structures for reporting plugin and job statuses, enhancing observability.
    • Improved health check and readiness handler to provide structured JSON responses with detailed error reporting.
    • Increased polling capability by modifying configuration for concurrent job processing.
  • Bug Fixes

    • Enhanced error handling in health checks and job reporting to improve clarity and robustness.
  • Refactor

    • Replaced string-based job state reporting with structured data handling for better maintainability.
  • Tests

    • Added new test cases for validating health and readiness of plugins, improving test reliability and clarity.

@rustatian rustatian added the enhancement New feature or request label Sep 15, 2024
@rustatian rustatian requested a review from wolfy-j September 15, 2024 21:26
@rustatian rustatian self-assigned this Sep 15, 2024
Copy link

coderabbitai bot commented Sep 15, 2024

Walkthrough

The changes in this pull request involve modifications to several files related to the health check and job reporting functionalities of a Go application. Key updates include the introduction of new data structures for reporting plugin and job statuses, enhancements to the health check mechanism for improved observability, and adjustments to configuration files to support better job management. Additionally, some linters were removed from the configuration, and tests were updated to improve reliability and clarity.

Changes

Files Change Summary
.golangci.yml Removed the gocognit and gocyclo linters from the configuration file.
api.go, jobs.go Added new data structures: Report and JobsReport in api.go, and refactored job reporting in jobs.go to use JobsReport.
health.go, ready.go Enhanced health check functionality in health.go with structured reporting and improved error handling. Updated ready.go to provide detailed JSON responses for plugin readiness.
plugin.go Removed the template string constant related to plugin status formatting.
tests/configs/.rr-jobs-status.yaml Increased the number of pollers from 1 to 2 in the job status configuration file.
tests/configs/.rr-status-503.yaml Added a new section for job management, specifying parameters for job processing including num_pollers, pipeline_size, and worker pool settings.
tests/plugin_test.go Introduced new test cases for health and readiness checks, updated existing tests for structured assertions, and standardized error handling across tests.

Assessment against linked issues

Objective Addressed Explanation
Define a single source for determining the health of the entire application server (#1998)
Simplify health check requests to not require explicit plugin listing (#1998) The changes improve reporting but do not fully implement the proposal for a simpler health check.

Possibly related PRs

Poem

🐇 In the meadow where plugins play,
Health checks dance in a bright array.
With reports so clear, and jobs in line,
Our server's health will surely shine!
No more lost plugins in the fray,
Hooray for changes that save the day! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    -- I pushed a fix in commit <commit_id>, please review it.
    -- Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    -- @coderabbitai generate unit testing code for this file.
    -- @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    -- @coderabbitai generate interesting stats about this repository and render them as a table.
    -- @coderabbitai read src/utils.ts and generate unit testing code.
    -- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    -- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Outside diff range and nitpick comments (2)
tests/configs/.rr-jobs-status.yaml (1)

14-14: Looks good! Consider monitoring performance and resource utilization post-deployment.

Increasing num_pollers from 1 to 2 can potentially improve the system's ability to process jobs concurrently, leading to better performance and reduced processing times.

However, it's important to monitor the system's resource utilization (CPU, memory) and overall performance after deploying this change to production. If the increased concurrency leads to resource contention or degraded performance, you may need to adjust the num_pollers value or scale the system's resources accordingly.

Consider collecting metrics and logs related to job processing, queue lengths, and resource utilization. Analyze this data to determine if further tuning of the num_pollers parameter is necessary to achieve the desired performance characteristics while ensuring the system remains stable and responsive.

ready.go (1)

36-37: Consider initializing the report slice with a more appropriate capacity.

The current initialization of the report slice with a capacity of 2 seems arbitrary and may not be sufficient for all scenarios. Consider initializing the slice with a capacity equal to the number of plugins in the statusRegistry to avoid unnecessary allocations and improve performance.

-report := make([]*Report, 0, 2)
+report := make([]*Report, 0, len(rd.statusRegistry))
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1b9c916 and 2d97bc0.

Files ignored due to path filters (2)
  • go.work.sum is excluded by !**/*.sum
  • tests/go.sum is excluded by !**/*.sum
Files selected for processing (9)
  • .golangci.yml (0 hunks)
  • api.go (1 hunks)
  • health.go (2 hunks)
  • jobs.go (2 hunks)
  • plugin.go (0 hunks)
  • ready.go (2 hunks)
  • tests/configs/.rr-jobs-status.yaml (1 hunks)
  • tests/configs/.rr-status-503.yaml (2 hunks)
  • tests/plugin_test.go (14 hunks)
Files not reviewed due to no reviewable changes (2)
  • .golangci.yml
  • plugin.go
Additional comments not posted (17)
tests/configs/.rr-status-503.yaml (2)

14-20: LGTM! Consider adjusting the settings for production deployment.

The introduction of the jobs section with parameters for job management aligns with the PR objective of enhancing job processing capabilities. The current settings seem suitable for testing or low-traffic scenarios.

For production deployment, consider the following:

  • Adjust num_pollers and num_workers based on the expected workload and system resources to ensure optimal performance and scalability.
  • Fine-tune the pipeline_size value based on the specific use case and available system resources to strike a balance between throughput and resource utilization.
  • Monitor the allocate_timeout and destroy_timeout values in production and adjust them if necessary to ensure efficient resource management without causing resource contention or starvation.

32-32: Looks good!

The grace_period value of 10 seconds seems appropriate for allowing the server to shut down gracefully. It provides sufficient time for the server to complete any ongoing tasks and release resources gracefully before terminating.

api.go (2)

3-7: LGTM!

The Report struct is well-defined and aligns with the intended usage. The fields are appropriately named and tagged for JSON serialization.


9-19: LGTM!

The JobsReport struct is well-defined and captures essential job-related metrics. The fields are appropriately named and tagged for JSON serialization. This struct will be valuable for monitoring and debugging job performance and health.

jobs.go (2)

49-60: LGTM!

The code correctly populates the report slice with JobsReport instances created from the jobStates elements. The field mappings are accurate, and the logic is sound.


62-66: LGTM!

The code correctly marshals the report slice into JSON format and handles any errors that may occur during the process. The error handling ensures that any issues are logged appropriately.

health.go (5)

4-4: LGTM!

The "encoding/json" package import is necessary for JSON serialization and deserialization in the ServeHTTP method.


33-34: LGTM!

The report slice is initialized with a capacity of 2, which is a good practice to avoid unnecessary allocations. The comment provides clarity on the purpose of the report slice.


37-106: Comprehensive health check implementation!

The code block handles the case when no plugins are provided in the query. It performs a comprehensive health check by iterating over all registered plugins and checking their status. The detailed reports appended to the report slice provide valuable information about each plugin's health status, including error messages and status codes.

The code handles various scenarios such as uninitialized plugins, plugins returning nil, and unexpected status codes. The response is properly serialized to JSON and written to the response writer.

Overall, this code block enhances the health check functionality by providing a comprehensive overview of all plugins' health status when no specific plugins are requested.


108-160: Targeted health check implementation!

The code block iterates over all provided plugins in the query and performs a targeted health check. It checks the status of each plugin and appends detailed reports to the report slice, including error messages and status codes.

The code handles various scenarios such as uninitialized plugins, plugins returning nil, and unexpected status codes. Setting the response status code to unavailableStatusCode for status codes >= 500 is appropriate.

Additionally, the code logs information when a plugin does not support health checks, providing visibility into unsupported plugins.

Overall, this code block enhances the health check functionality by allowing targeted health checks for specific plugins provided in the query.


161-168: LGTM!

The report slice is serialized to JSON, and the serialized JSON response is written to the response writer, which is the correct way to send the response.

Any error during JSON marshaling is logged, which is important for debugging purposes.

ready.go (4)

40-96: LGTM!

The code changes for handling the readiness checks of all plugins when no specific plugins are provided in the query parameters look good. The code flow, error handling, and logging seem appropriate. The report slice is being populated correctly based on the readiness status of each plugin.


111-159: LGTM!

The code changes for handling the readiness checks of the provided plugins in the query parameters look good. The code flow, error handling, and HTTP status code setting seem appropriate. The report slice is being populated correctly based on the readiness status of each provided plugin.


160-162: LGTM!

The code correctly logs an informational message when a provided plugin does not support readiness checks and continues execution without adding any report for the unsupported plugin.


169-170: LGTM!

The code correctly writes the JSON response to the ResponseWriter. Although the error returned by w.Write is being ignored, it is a common practice in HTTP handlers.

tests/plugin_test.go (2)

661-680: LGTM!

The checkHTTPStatusAll function correctly verifies the response from the /health endpoint without any query parameters. The changes improve the reliability of the test by unmarshalling the JSON response and asserting the contents of the status.Report slice.


719-733: LGTM!

The checkHTTPReadinessAll function correctly verifies the response from the /ready endpoint without any query parameters. The changes improve the reliability of the test by unmarshalling the JSON response into a slice of status.Report structs and asserting the response status code.

jobs.go Outdated Show resolved Hide resolved
jobs.go Outdated Show resolved Hide resolved
ready.go Outdated Show resolved Hide resolved
ready.go Show resolved Hide resolved
Signed-off-by: Valery Piashchynski <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (1)
health.go (1)

33-35: Consider setting the report slice capacity based on the expected number of plugins.

The current capacity of 2 seems arbitrary. To avoid unnecessary allocations, set the capacity based on the expected number of plugins.

For example, if the expected number of plugins is known, you can set the capacity as follows:

report := make([]*Report, 0, expectedPluginCount)

If the expected number of plugins is unknown, you can use the length of the statusRegistry map as a reasonable estimate:

report := make([]*Report, 0, len(rd.statusRegistry))
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 2d97bc0 and 8546b5e.

Files selected for processing (3)
  • health.go (2 hunks)
  • jobs.go (2 hunks)
  • ready.go (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • ready.go
Additional comments not posted (8)
jobs.go (4)

45-46: Suggestion implemented: Using len(jobStates) as the capacity for the report slice.

The code has been updated to set the capacity of the report slice to len(jobStates), as suggested in the previous review. This ensures that the slice has enough capacity to store all the job reports without the need for reallocation.


49-60: LGTM: Structured approach for job state reporting.

The code segment populates the report slice by iterating over jobStates and creating JobsReport instances. This change enhances code clarity and maintainability by using a structured approach to represent job state information. The implementation looks good.


62-66: LGTM: JSON serialization and error handling.

The code segment serializes the report slice into JSON format using json.Marshal, which is a standardized format that is easier for clients to consume. Error handling has been introduced to ensure that any issues during the JSON marshaling process are logged appropriately. The implementation looks good.


68-70: Suggestion implemented: Handling the error returned by w.Write.

The code has been updated to handle the error returned by w.Write, as suggested in the previous review. This ensures that any issues during the write operation are logged appropriately.

health.go (4)

38-106: LGTM!

The code segment handling the scenario where no plugins are provided in the query looks good. It comprehensively checks the health of all registered plugins, handles various error scenarios, and provides a structured JSON response. The error handling and logging are also implemented appropriately.


111-163: LGTM!

The code segment handling the scenario where specific plugins are provided in the query looks good. It checks the health of the specified plugins, handles various error scenarios, and provides a structured JSON response. The error handling and logging are also implemented appropriately.


165-168: LGTM!

The code segment for serializing the report slice to JSON looks good. It properly marshals the data and logs any errors that occur during the process.


170-174: LGTM!

The code segment for writing the serialized JSON response to the response writer looks good. It properly writes the data and logs any errors that occur during the process.

@rustatian rustatian merged commit dc69b41 into master Sep 16, 2024
7 checks passed
@rustatian rustatian deleted the feature/1998-simplify-health-checks branch September 16, 2024 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[💡 FEATURE REQUEST]: Simplifying the health check of the entire Application Server
1 participant